Background

The coursework involves an individual analysis of two datasets: crime23.csv and temp2023.csv. These datasets relate to street-level crime incidents and daily climate data in Colchester during the year 2023. The crime data has been extracted using an interface that provides detailed descriptions of the variables, accessible at https://ukpolice.njtierney.com/reference/ukp_crime.html. Similarly, the climate data was collected from a weather station near Colchester, with variable descriptions and extractions interface available at https://bczernecki.github.io/climate/reference/meteo_ogimet.html.

Task and Aim:

The objective is to conduct a comprehensive analysis of the datasets, and the aim is to explore patterns, trends and relationships within the crime and climate data and gain insights into factors influencing the occurrences and potential relationships within the datasets.

Methodology

The analysis will include descriptive statistics, data visualisation, and correlation analysis to uncover patterns and relations within the datasets. Advanced graphics and interactive plots will be utilised to enhance the presentation of the findings. The project will be carried out using R Markdown, providing a detailed interpretation of the results.

# Load libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.1
## ✔ readr   2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(leaflet)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
## 
## Attaching package: 'ggmap'
## 
## 
## The following object is masked from 'package:plotly':
## 
##     wind
# Load the dataset
colc_crime_data <- read_csv("crime23.csv")
## Rows: 6878 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): category, persistent_id, date, street_name, location_type, location...
## dbl (4): lat, long, street_id, id
## lgl (1): context
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the structure of the data frame
str(colc_crime_data)
## spc_tbl_ [6,878 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ category        : chr [1:6878] "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ persistent_id   : chr [1:6878] NA NA NA NA ...
##  $ date            : chr [1:6878] "2023-01" "2023-01" "2023-01" "2023-01" ...
##  $ lat             : num [1:6878] 51.9 51.9 51.9 51.9 51.9 ...
##  $ long            : num [1:6878] 0.909 0.902 0.898 0.902 0.895 ...
##  $ street_id       : num [1:6878] 2153366 2153173 2153077 2153186 2153012 ...
##  $ street_name     : chr [1:6878] "On or near Military Road" "On or near" "On or near Culver Street West" "On or near Ryegate Road" ...
##  $ context         : logi [1:6878] NA NA NA NA NA NA ...
##  $ id              : num [1:6878] 1.08e+08 1.08e+08 1.08e+08 1.08e+08 1.08e+08 ...
##  $ location_type   : chr [1:6878] "Force" "Force" "Force" "Force" ...
##  $ location_subtype: chr [1:6878] NA NA NA NA ...
##  $ outcome_status  : chr [1:6878] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   category = col_character(),
##   ..   persistent_id = col_character(),
##   ..   date = col_character(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   street_id = col_double(),
##   ..   street_name = col_character(),
##   ..   context = col_logical(),
##   ..   id = col_double(),
##   ..   location_type = col_character(),
##   ..   location_subtype = col_character(),
##   ..   outcome_status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

The dataset has 6,878 observations and 12 attributes. Here’s a look into the columns:

category is s character vector indicating the category of each crime reported.

persistent_id is a character vector representing the persistent ID for each crime. It contains several missing values (NA).

date is a character vector representing the date of each crime reported, formatted as YYYY-MM.

lat is a numeric vector representing the latitude of each crime location.

long is a numeric vector representing the longitude of each crime location.

street_id is a numeric vector representing the unique identifier for the street where each crime occurred.

street_name is a character vector representing the name of the location where each crime occurred. It contains some missing values.

context is a logical vector indicating if there is any additional context for each crime. It appears to contain missing values.

id is a numeric vector representing the ID of each crime. It is likely to be a unique identifier.

location_type is a character vector representing the type of location where each crime was recorded (e.g., “Force” and “BPT”).

location_subtype is a character vector representing the subtype of location for each crime. It contains some missing values.

outcome_status is a character vector representing the outcome status of each crime. It also contains missing values.



Data Preprocessing
# Check for missing values in the dataset
sum(is.na(colc_crime_data))
## [1] 15110

There are 15,110 missing values across the dataset

# Calculate the number of missing values for each column
missing_val <- colSums(is.na(colc_crime_data))

# Filter columns with missing values
missing_col <- missing_val[missing_val > 0]

# Create a data frame with columns and their respective missing value counts
missing_col_with_val <- data.frame(Column = names(missing_col), Missing_Val = missing_col, row.names = NULL)

# Display the missing values with their columns
missing_col_with_val
##             Column Missing_Val
## 1    persistent_id         701
## 2          context        6878
## 3 location_subtype        6854
## 4   outcome_status         677

The results show that “persistent_id” has 701 missing values, “context” has all the columns missing, “location_subtype” has a high count of 6,854 missing values, and “outcome_status” records 677 missing values.

# Handle missing data

# Drop the "context" and "location_subtype" column
colc_crime_data <- colc_crime_data[, !(names(colc_crime_data) %in% c("context", "location_subtype"))]

# Replace missing values in persistent_id and outcome_status with "Unknown"
colc_crime_data <- colc_crime_data %>%
  mutate(persistent_id = ifelse(is.na(persistent_id), "Unknown", persistent_id),
         outcome_status = ifelse(is.na(outcome_status), "Unknown", outcome_status))

# confirm the change
head(colc_crime_data)
## # A tibble: 6 × 10
##   category          persistent_id date    lat  long street_id street_name     id
##   <chr>             <chr>         <chr> <dbl> <dbl>     <dbl> <chr>        <dbl>
## 1 anti-social-beha… Unknown       2023…  51.9 0.909   2153366 On or near… 1.08e8
## 2 anti-social-beha… Unknown       2023…  51.9 0.902   2153173 On or near  1.08e8
## 3 anti-social-beha… Unknown       2023…  51.9 0.898   2153077 On or near… 1.08e8
## 4 anti-social-beha… Unknown       2023…  51.9 0.902   2153186 On or near… 1.08e8
## 5 anti-social-beha… Unknown       2023…  51.9 0.895   2153012 On or near… 1.08e8
## 6 anti-social-beha… Unknown       2023…  51.9 0.909   2153379 On or near… 1.08e8
## # ℹ 2 more variables: location_type <chr>, outcome_status <chr>

In handling the missing data, the “context” and “location_subtype” columns were dropped, while the missing values in the “persistent_id” and “outcome_status” were replaced by “Unknown”. The “context” and “location_subtype” columns are virtually empty; hence, they were dropped as they do not provide helpful information for the analysis. Replacing missing values with ‘Unknown’ maintains the dataset’s completeness. This approach effectively accounts for the missing values in the ‘persistent_id’ and ‘outcome_status’ columns, allowing for analysis and visualisation of the data and also helps to provide valuable insights into the recorded crime incidents in Colchester in 2023.

# Check for duplicates
duplicate_rows <- colc_crime_data[duplicated(colc_crime_data),]
print(duplicate_rows)
## # A tibble: 0 × 10
## # ℹ 10 variables: category <chr>, persistent_id <chr>, date <chr>, lat <dbl>,
## #   long <dbl>, street_id <dbl>, street_name <chr>, id <dbl>,
## #   location_type <chr>, outcome_status <chr>

The result above shows no duplicate rows in the dataset; hence, there is no need to remove duplicates.

# Load library
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Convert date column to year-month format
colc_crime_data$date <- as.yearmon(colc_crime_data$date)

# Convert date column to Date format
colc_crime_data$date <- as.Date(colc_crime_data$date)

# Confirm the new date format
str(colc_crime_data)
## tibble [6,878 × 10] (S3: tbl_df/tbl/data.frame)
##  $ category      : chr [1:6878] "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ persistent_id : chr [1:6878] "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ date          : Date[1:6878], format: "2023-01-01" "2023-01-01" ...
##  $ lat           : num [1:6878] 51.9 51.9 51.9 51.9 51.9 ...
##  $ long          : num [1:6878] 0.909 0.902 0.898 0.902 0.895 ...
##  $ street_id     : num [1:6878] 2153366 2153173 2153077 2153186 2153012 ...
##  $ street_name   : chr [1:6878] "On or near Military Road" "On or near" "On or near Culver Street West" "On or near Ryegate Road" ...
##  $ id            : num [1:6878] 1.08e+08 1.08e+08 1.08e+08 1.08e+08 1.08e+08 ...
##  $ location_type : chr [1:6878] "Force" "Force" "Force" "Force" ...
##  $ outcome_status: chr [1:6878] "Unknown" "Unknown" "Unknown" "Unknown" ...

The character format in the “date” didn’t convert directly to a date-time format; hence, the ‘as.yearmon()’ function from the ‘zoo’ package was used to convert to the year-month format. Then, the as.Date function was used to to convert it to Date format.

# Check unique values in categorical columns
unique_val <- sapply(colc_crime_data[, sapply(colc_crime_data, is.character)],
                     function(x) length(unique(x)))

# print unique values
unique_val
##       category  persistent_id    street_name  location_type outcome_status 
##             14           6177            351              2             14

The dataset comprises various categorical columns with distinct characteristics. In the “category” column, there are 14 unique categories representing different types of reported crimes. each incident is identified by a “persistent_id”, with 6177 unique identifiers, which enables tracking of individual occurrences. The data spans 12 months, recorded in “date” column, which indicates the temporal distribution of the reported incidents. Streets where crimes occur are diverse, with 351 unique street names captured in the “street_name” column. Location information is categorised into two types, which is given as the “location_type” column, which contains 2 unique types. The outcomes of the reported incidents are shown in the “outcome_status” column, revealing 14 distinct statuses indicating the resolution of each case.



Exploratory Data Analysis and Visualisation

Two-way table
# create a two-way table for category and outcome status
two_way_table <- table(colc_crime_data$category, colc_crime_data$outcome_status)
two_way_table
##                        
##                         Action to be taken by another organisation
##   anti-social-behaviour                                          0
##   bicycle-theft                                                  0
##   burglary                                                       0
##   criminal-damage-arson                                          5
##   drugs                                                          2
##   other-crime                                                    1
##   other-theft                                                    1
##   possession-of-weapons                                          1
##   public-order                                                   6
##   robbery                                                        0
##   shoplifting                                                    5
##   theft-from-the-person                                          0
##   vehicle-crime                                                  0
##   violent-crime                                                 83
##                        
##                         Awaiting court outcome Court result unavailable
##   anti-social-behaviour                      0                        0
##   bicycle-theft                              0                        1
##   burglary                                   1                       15
##   criminal-damage-arson                     31                       22
##   drugs                                     18                       17
##   other-crime                                6                        3
##   other-theft                                2                        6
##   possession-of-weapons                      9                       10
##   public-order                              18                       17
##   robbery                                    4                        4
##   shoplifting                               51                       45
##   theft-from-the-person                      0                        0
##   vehicle-crime                              6                        2
##   violent-crime                            114                       64
##                        
##                         Formal action is not in the public interest
##   anti-social-behaviour                                           0
##   bicycle-theft                                                   0
##   burglary                                                        0
##   criminal-damage-arson                                           0
##   drugs                                                           1
##   other-crime                                                     0
##   other-theft                                                     1
##   possession-of-weapons                                           0
##   public-order                                                    2
##   robbery                                                         0
##   shoplifting                                                     1
##   theft-from-the-person                                           0
##   vehicle-crime                                                   0
##   violent-crime                                                   4
##                        
##                         Further action is not in the public interest
##   anti-social-behaviour                                            0
##   bicycle-theft                                                    0
##   burglary                                                         2
##   criminal-damage-arson                                            2
##   drugs                                                           10
##   other-crime                                                      6
##   other-theft                                                      1
##   possession-of-weapons                                            1
##   public-order                                                    12
##   robbery                                                          0
##   shoplifting                                                     11
##   theft-from-the-person                                            0
##   vehicle-crime                                                    0
##   violent-crime                                                   37
##                        
##                         Further investigation is not in the public interest
##   anti-social-behaviour                                                   0
##   bicycle-theft                                                           0
##   burglary                                                                0
##   criminal-damage-arson                                                   0
##   drugs                                                                   0
##   other-crime                                                             8
##   other-theft                                                             0
##   possession-of-weapons                                                   0
##   public-order                                                            0
##   robbery                                                                 0
##   shoplifting                                                             0
##   theft-from-the-person                                                   0
##   vehicle-crime                                                           0
##   violent-crime                                                           0
##                        
##                         Investigation complete; no suspect identified
##   anti-social-behaviour                                             0
##   bicycle-theft                                                   216
##   burglary                                                        154
##   criminal-damage-arson                                           363
##   drugs                                                            15
##   other-crime                                                      13
##   other-theft                                                     350
##   possession-of-weapons                                             9
##   public-order                                                    193
##   robbery                                                          38
##   shoplifting                                                     299
##   theft-from-the-person                                            61
##   vehicle-crime                                                   350
##   violent-crime                                                   595
##                        
##                         Local resolution Offender given a caution
##   anti-social-behaviour                0                        0
##   bicycle-theft                        1                        0
##   burglary                             0                        0
##   criminal-damage-arson               14                        4
##   drugs                               98                        6
##   other-crime                          0                        2
##   other-theft                          1                        2
##   possession-of-weapons               11                        6
##   public-order                         7                        1
##   robbery                              1                        0
##   shoplifting                         34                        5
##   theft-from-the-person                0                        0
##   vehicle-crime                        1                        0
##   violent-crime                       71                       35
##                        
##                         Status update unavailable
##   anti-social-behaviour                         0
##   bicycle-theft                                 8
##   burglary                                     13
##   criminal-damage-arson                         6
##   drugs                                         9
##   other-crime                                  10
##   other-theft                                  12
##   possession-of-weapons                         5
##   public-order                                 14
##   robbery                                       5
##   shoplifting                                   4
##   theft-from-the-person                         2
##   vehicle-crime                                 5
##   violent-crime                                84
##                        
##                         Suspect charged as part of another case
##   anti-social-behaviour                                       0
##   bicycle-theft                                               0
##   burglary                                                    0
##   criminal-damage-arson                                       0
##   drugs                                                       0
##   other-crime                                                 0
##   other-theft                                                 0
##   possession-of-weapons                                       0
##   public-order                                                0
##   robbery                                                     0
##   shoplifting                                                 1
##   theft-from-the-person                                       0
##   vehicle-crime                                               0
##   violent-crime                                               0
##                        
##                         Unable to prosecute suspect Under investigation Unknown
##   anti-social-behaviour                           0                   0     677
##   bicycle-theft                                   8                   1       0
##   burglary                                       20                  20       0
##   criminal-damage-arson                         117                  17       0
##   drugs                                          13                  19       0
##   other-crime                                    34                   9       0
##   other-theft                                    94                  21       0
##   possession-of-weapons                          12                  10       0
##   public-order                                  218                  44       0
##   robbery                                        35                   7       0
##   shoplifting                                    76                  22       0
##   theft-from-the-person                          10                   3       0
##   vehicle-crime                                  23                  19       0
##   violent-crime                                1299                 247       0

The two-way table above represents a cross-tabulation of the counts of crime incidents based on their category and outcome status.

It shows how many incidents fall into each outcome status for each crime category. For example, in the “Investigation complete; no suspect identified” outcome status, higher counts were observed for categories such as “Bicycle Theft” (216), “Burglary” (154), “Criminal Damage/Arson” (363), “Other Theft” (350), “Shoplifting” (299), “Vehicle Crime” (350), and “Violent Crime” (595). The “Unknown” outcome status has a high count across various crime categories, indicating cases where the outcome status is not specified or known.

It can be deduced that certain types of crimes may have different resolution statuses. For example, crimes like “Bicycle Theft” and “Shoplifting” tend to have higher counts of “Investigation complete; no suspect identified” outcome status, suggesting that identifying suspects might be more challenging for these types of crimes. Crimes categorised as “violent-crime” have a relatively high count across various outcome statuses, indicating the complexity and severity of these incidents.

Understanding the distribution of outcome statuses for different crime categories can help law enforcement agencies allocate resources effectively and prioritise investigations based on the likelihood of resolution. It can also inform decision-making and resource allocation strategies for crime prevention and law enforcement efforts.

Pie chart
# Explore the distribution of crime by category using pie chart

# Load libraries
library(ggplot2)
library(plotly)

# Group by category and calculate frequencies
crime_by_category <- colc_crime_data %>%
  group_by(category) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency))

# Calculate percentage
crime_by_category$percentage <- crime_by_category$frequency / sum(crime_by_category$frequency) * 100

print(crime_by_category)
## # A tibble: 14 × 3
##    category              frequency percentage
##    <chr>                     <int>      <dbl>
##  1 violent-crime              2633      38.3 
##  2 anti-social-behaviour       677       9.84
##  3 criminal-damage-arson       581       8.45
##  4 shoplifting                 554       8.05
##  5 public-order                532       7.73
##  6 other-theft                 491       7.14
##  7 vehicle-crime               406       5.90
##  8 bicycle-theft               235       3.42
##  9 burglary                    225       3.27
## 10 drugs                       208       3.02
## 11 robbery                      94       1.37
## 12 other-crime                  92       1.34
## 13 theft-from-the-person        76       1.10
## 14 possession-of-weapons        74       1.08
# Create a pie chart
pie_chart <- plot_ly(crime_by_category, labels = ~category, values = ~frequency, type = 'pie', 
                     textinfo = 'label+percent', insidetextfont = list(color = '#FFFFFF', size = 10)) %>%
                     layout(title = "Distribution of Crimes by Category")

# Show the pie chart
pie_chart

The pie chart (using interactive plot) represents the distribution of crime by category in Colchester in 2023. Hovering the mouse over the plot will show the frequency and percentage of the crimes.

The violent-crime (38.28%) represents the most prevalent category in the dataset with 2633 reported incidents. This high frequency suggests a significant concern for public safety and underscores the need for interventions to address instances of violence within the community. While still substantial, anti-social behaviour (9.84%) accounts for a smaller proportion of reported incidents compared to violent crime. However, with 677 reported cases, it remains a notable issue that may contribute to community disruption and discomfort. The presence of 581 reported incidents of criminal damage/arson (8.45%) highlights concerns regarding property-related offences and deliberate acts of vandalism. Addressing these incidents is essential for preserving public and private property.

With 554 reported cases, shoplifting (8.05%)constitutes a significant portion of reported crimes, indicating instances of theft occurring in commercial establishments. This may have economic implications for businesses and consumers alike. Public-order-offences (7.73%), with 532 reported incidents, signify challenges related to maintaining order and preventing disturbances within the community. Addressing public order issues is crucial for ensuring residents’ safe and peaceful environment. The other-theft (7.14%) category includes thefts not classified elsewhere, with 491 reported incidents. The diversity of theft-related offences underscores the need for comprehensive strategies to combat various forms of theft.

The presence of 406 reported incidents of vehicle-crime (5.90%) suggests concerns regarding thefts or vandalism involving vehicles. Protecting vehicles and preventing such offences is essential for vehicle owners and the community at large. While representing a smaller proportion of reported crimes, the 235 reported incidents of bicycle theft (3.42%) indicate instances of theft targeting bicycles. Addressing bicycle theft can contribute to promoting cycling as a sustainable mode of transportation. With 225 reported incidents, burglary (3.37%) involves unlawful entry into buildings intending to commit theft or other crimes. Preventing burglaries is crucial for safeguarding residential and commercial properties. The presence of 208 reported incidents of drug-related offences (3.02%) highlights concerns regarding the possession, distribution, or trafficking of controlled substances. Addressing drug-related activities is essential for combating substance abuse and associated criminal behaviours.

With 94 reported incidents, robbery (1.37%) involves theft or attempted theft that includes force or the threat of force against individuals. While constituting a relatively small proportion of reported crimes, instances of robbery are concerning due to their potentially violent nature. The other-crime category (1.34%), with 92 reported incidents, encompasses crimes that do not fall into the preceding categories. It reflects a diverse range of offences not specifically categorised elsewhere; this highlights the complexity of criminal activities within the community. With 76 reported incidents, theft-from-the-person (1.10%) involves direct theft from individuals, such as pickpocketing or purse snatching. While representing a smaller portion of reported crimes, it underscores individuals’ vulnerability to targeted thefts. Finally, 74 reported incidents of possession of weapons (1.08%) signifies concerns regarding the unlawful possession of weapons or firearms within the community. Addressing weapons-related offences is essential for maintaining public safety and preventing potential harm.

Bar Plot
# Visualise the distribution of crime resolution status (Out_come status)

# Count the number of occurrences for each outcome status
outcome_counts <- colc_crime_data %>%
  count(outcome_status, sort = TRUE)

# Count the number of occurrences for each outcome status
outcome_counts <- colc_crime_data %>%
  count(outcome_status, sort = TRUE)
outcome_counts
## # A tibble: 14 × 2
##    outcome_status                                          n
##    <chr>                                               <int>
##  1 Investigation complete; no suspect identified        2656
##  2 Unable to prosecute suspect                          1959
##  3 Unknown                                               677
##  4 Under investigation                                   439
##  5 Awaiting court outcome                                260
##  6 Local resolution                                      239
##  7 Court result unavailable                              206
##  8 Status update unavailable                             177
##  9 Action to be taken by another organisation            104
## 10 Further action is not in the public interest           82
## 11 Offender given a caution                               61
## 12 Formal action is not in the public interest             9
## 13 Further investigation is not in the public interest     8
## 14 Suspect charged as part of another case                 1
# Define custom colors
custom_colors <- c("chartreuse4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "dodgerblue4", "deeppink3", "slategray3", "#bcbd22", "#17becf", "green2", "coral1", "cadetblue1", "royalblue1")

# Plotting the distribution of outcome status with custom colors
outcome_plot <- ggplot(outcome_counts, aes(x = reorder(outcome_status, n), y = n, fill = outcome_status, text = paste("Outcome Status:", outcome_status, "<br>Number of Incidents:", n))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = custom_colors[1:length(unique(colc_crime_data$outcome_status))]) + 
  labs(title = "Distribution of Crime Outcome Status in Colchester (2023)",
       x = "Outcome Status",
       y = "Number of Incidents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 2, size =10),
        legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

# Convert ggplot to plotly object and enable hover
outcome_plotly <- ggplotly(outcome_plot, tooltip = c("text"))

# Display the plot
outcome_plotly

The interactive bar plot above gives insights into the distribution of how reported crimes are resolved in Colchester for 2023. As you hover over each bar, the count for each outcome are shown.

The majority of incidents (2656) had investigations completed, but no suspects were identified. Following closely were incidents (1959) where suspects were known but could not be prosecuted. Notably, there were 677 cases with unknown outcomes, indicating gaps in data availability. Additionally, ongoing investigations accounted for 439 incidents, while 260 awaited court outcomes. A significant portion (239) was resolved locally, without formal legal proceedings. However, information on court results and status updates was unavailable for 206 and 177 incidents, respectively.

Furthermore, 104 cases required action from other organizations, suggesting collaboration efforts. In 82 instances, further action was deemed against public interest, while formal action was not pursued in 9 cases. Moreover, in each of the eight incidents, there were decisions where further investigation or formal action was not in the public interest. Lastly, only one incident involved a suspect being charged as part of another case, hinting at potential interrelated criminal activities.

Leaflet - Map
# Crime Time Series Map
library(plotly)

# Convert date column to character format
colc_crime_data$date <- as.character(colc_crime_data$date)

# Create a time series map with advanced layers
colc_crime_map <- plot_ly(data = colc_crime_data, type = "scattermapbox", mode = "markers") %>%
  add_trace(lat = ~lat, lon = ~long, color = ~category, colors = "Set1", size = 5,
            text = ~paste("Category: ", category, "<br>Date: ", date),
            hoverinfo = "text",
            frame = ~date,
            frame_style = list(title = "Date: %{frame}")) %>%
  layout(title = "Crime Time Series Map for Colchester (2023)",
         mapbox = list(style = "carto-positron",
                       zoom = 10,
                       center = list(lon = mean(colc_crime_data$long), lat = mean(colc_crime_data$lat))),
         xaxis = list(title = "Longitude"),
         yaxis = list(title = "Latitude"),
         legend = list(title = "Category"),
         updatemenus = list(list(
           buttons = list(
             list(
               args = list(frame = list(duration = 1000, redraw = TRUE), 
                           fromcurrent = TRUE),
               label = "Play",
               method = "animate"
             ),
             list(
               args = list(frame = list(duration = 0, redraw = TRUE), 
                           mode = "immediate"),
               label = "Pause",
               method = "animate"
             )
           ),
           direction = "left",
           pad = list(r = 10, t = 87),
           showactive = FALSE,
           type = "buttons",
           x = 0.1,
           xanchor = "right",
           y = 0,
           yanchor = "top"
         ))
  )

# Display the time series map
colc_crime_map
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

The plot above is an interactive map of the crime time series for Colchester in 2023. It is plotted to explore crime data over time. The map uses a carto-positron style, a light-coloured base map emphasising streets and labels. Zoom level and centre are set to show the entirety of Colchester. Each marker represents a crime incident, and the colour of the marker corresponds to the crime category (legend on the right).

When you hover over a marker, additional information specifically - “category” which shows the type of crime for that incidence (e.g., anti-social-behaviour, theft-from-the-person) and Date: The date on which the crime occurred are shown. An animation (a time slider) with a play and pause button in the bottom left corner allows you to animate the map over time. By clicking “Play,” the markers appear sequentially, representing the crime incidents throughout 2023. This animation helps visualise temporal patterns in crime occurrences.

Moving across the months by using the slide, you will notice that there is an increase in the intensity of crime as you move down the month suggesting that periods close to festive season records more frequent crime rate than other months. This can be linked to the festive seasons where the people want to do anything at all cost to have some pleasure. It is worth noting that this period is to be given adequate attention by the law enforcement agents and further research can be carried out to have deeper understanding as to other factors responsible for this.

names(colc_crime_data)
##  [1] "category"       "persistent_id"  "date"           "lat"           
##  [5] "long"           "street_id"      "street_name"    "id"            
##  [9] "location_type"  "outcome_status"
Histogram
library(dplyr)
library(ggplot2) 
# Get the top ten streets 
# Group the data by street name and summarize the total crime count for each street
street_crime_counts <- colc_crime_data %>%
  group_by(street_name) %>%
  summarise(total_crime_count = n()) %>%
  ungroup()

# Arrange the data in descending order of crime count and select the top 10 streets
top_ten_streets <- street_crime_counts %>%
  arrange(desc(total_crime_count)) %>%
  head(10)

# plot an interactive histogram plot for 10 most dangerous streets in

# Filter the data to include only the top 10 most dangerous streets
top_crime_streets <- street_crime_counts %>%
  arrange(desc(total_crime_count)) %>%
  head(10)  # Select the top 10 streets

# Create the histogram
histogram <- plot_ly(top_crime_streets, x = ~street_name, y = ~total_crime_count, type = "bar",
                     marker = list(color = ~total_crime_count, 
                                   colorscale = "Viridis",
                                   line = list(color = "black")),
                     hoverinfo = "text",
                     text = ~paste("Street: ", street_name, "<br>Total Crime Count: ", total_crime_count)) %>%
  layout(title = "Top 10 Most Dangerous Streets in Colchester (2023)",
         xaxis = list(title = "Street Name"),
         yaxis = list(title = "Total Crime Count"),
         hovermode = "closest",
         showlegend = FALSE)

# Display the interactive histogram
histogram

The interactive histogram plot identifies the top 10 streets with the highest total crime counts. As we hover the mouse on the bars, the interactive plot shows the street names and the total crime count. The streets “On or near” and “On or near Shopping Area” have the highest crime counts (495 and 328, respectively). This suggests a potential concentration of crime around shopping areas, possibly due to factors like increased opportunity for theft or vandalism.

Other locations with high crime counts include “On or near Supermarket” (243), “Parking Area” (171), and “Nightclub” (142). These locations might also attract criminal activity due to similar reasons. Streets like “Cowdray Avenue” (164), “St Nicholas Street” (150), and “Balkerne Gardens” (148) also appear on the list, indicating potential crime hotspots in residential or public areas. “Church Street” (144) and “George Street” (117) round out the top 10, suggesting some level of crime activity on these streets as well.

Scatter Plot
# Create a new column for converted date
colc_crime_data <- colc_crime_data %>%
  mutate(converted_date = as.Date(date))

# Aggregate data by year-month 
colc_crime_data_agg <- colc_crime_data %>% 
  mutate(year_month = format(converted_date, "%Y-%m")) %>% 
  group_by(year_month, lat, long) %>% 
  summarise(crime_count = n(), .groups = "drop") 

# Create a ggplot object
crime_map_gg <- ggplot(colc_crime_data_agg, aes(x = long, y = lat, size = crime_count, color = crime_count)) +
  geom_point(shape = 21, fill = "black") +  # Change shape and fill color
  scale_size_continuous(range = c(1, 5)) +   # Adjust size range
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Crime Incidence over Crime Count in Colchester (2023)", 
       x = "Longitude", 
       y = "Latitude", 
       size = "Crime Count",
       color = "Crime Count") +
  theme_minimal()

# Convert ggplot object to plotly object
crime_map_plotly <- ggplotly(crime_map_gg, tooltip = "text")

# Display the interactive plot
crime_map_plotly

The interactive scatter plot visualises crime incidence over crime count using the latitude and longitude from the dataset. It shows the location of the crimes plotted as circles, with the size and colour of the circles corresponding to the number of crimes that occurred at that location. The data points are aggregated by month.There seems to be a higher concentration of crimes in the central and southern parts of colchester, this could be due to a number of factors such as population density, commercial activity, or the presence of certain types of establishments. Furthermore, the colour of the circles which indicates the severity of the crimes, with red circles representing areas with more serious crimes and blue circles representing areas with less serious crimes.

Leaflet Map
library(leaflet)

# Create a leaflet map
crime_map_leaflet <- leaflet(data = colc_crime_data) %>%
  
  # Add tile layers for the base map
  addTiles() %>%
  
  # Add crime data as circle markers with different colors for each category
  addCircleMarkers(
    radius = 3,   # Set a fixed radius for the circles
    color = ~category, # Color based on category
    stroke = FALSE, # No border
    fillOpacity = 0.6, # Opacity of the fill
    popup = ~paste("Category:", category, "<br>Date:", date),  # Popup information
    label = ~paste("Category:", category)   # Label information
  ) %>%
  
  # Add scale bar
  addScaleBar(position = "bottomright") %>%
  
  # Set map options
  setView(lng = mean(colc_crime_data$long), lat = mean(colc_crime_data$lat), zoom = 10) # Set the initial view
## Assuming "long" and "lat" are longitude and latitude, respectively
# Display the map
crime_map_leaflet

As you hover on the points, different crime categories are noticed across the locations

Climate Data Analysis

# read the colchester 2023 climate dataset
colc_clm23 <- read.csv("temp2023.csv")

# check the first few rows
head(colc_clm23)
##   station_ID       Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1       3590 2023-12-31             8.7            10.6             4.4    7.2
## 2       3590 2023-12-30             6.6             9.7             4.4    4.2
## 3       3590 2023-12-29             9.9            11.4             6.9    6.0
## 4       3590 2023-12-28             9.9            11.5             4.0    7.5
## 5       3590 2023-12-27             5.8            10.6             3.9    3.7
## 6       3590 2023-12-26             9.8            12.7             6.3    7.6
##   HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1  89.6          S       25.0        63.0      999.0    6.2      8.0      8.0
## 2  85.5        WSW       22.7        50.0     1006.9    0.4      4.6      6.5
## 3  77.2         SW       32.8        61.2     1003.6    0.8      6.5      6.7
## 4  84.6        SSW       32.2        70.4     1003.2    2.8      6.8      7.1
## 5  86.4         SW       13.2        37.1     1016.4    2.0      4.0      6.9
## 6  86.9        WSW       23.5        46.3     1006.2    4.4      6.5      7.4
##   SunD1h VisKm PreselevHp SnowDepcm
## 1    0.0  26.3         NA        NA
## 2    1.1  48.3         NA        NA
## 3    0.1  26.7         NA        NA
## 4    0.0  25.1         NA        NA
## 5    3.2  30.1         NA        NA
## 6    0.0  45.8         NA        NA
# show the names of the dataset column
names(colc_clm23)
##  [1] "station_ID"      "Date"            "TemperatureCAvg" "TemperatureCMax"
##  [5] "TemperatureCMin" "TdAvgC"          "HrAvg"           "WindkmhDir"     
##  [9] "WindkmhInt"      "WindkmhGust"     "PresslevHp"      "Precmm"         
## [13] "TotClOct"        "lowClOct"        "SunD1h"          "VisKm"          
## [17] "PreselevHp"      "SnowDepcm"
# Check the structure of the data
str(colc_clm23)
## 'data.frame':    365 obs. of  18 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : chr  "2023-12-31" "2023-12-30" "2023-12-29" "2023-12-28" ...
##  $ TemperatureCAvg: num  8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
##  $ TemperatureCMax: num  10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
##  $ TemperatureCMin: num  4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
##  $ TdAvgC         : num  7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
##  $ HrAvg          : num  89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
##  $ WindkmhDir     : chr  "S" "WSW" "SW" "SSW" ...
##  $ WindkmhInt     : num  25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
##  $ WindkmhGust    : num  63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
##  $ PresslevHp     : num  999 1007 1004 1003 1016 ...
##  $ Precmm         : num  6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
##  $ TotClOct       : num  8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
##  $ lowClOct       : num  8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
##  $ SunD1h         : num  0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
##  $ VisKm          : num  26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
##  $ PreselevHp     : logi  NA NA NA NA NA NA ...
##  $ SnowDepcm      : int  NA NA NA NA NA NA NA NA NA NA ...

The climate dataset has 385 observations and 18 variables.

# Convert Date column to correct Date format
colc_clm23$Date <- as.Date(colc_clm23$Date)

# Confirm that the date column is now formatted appropriately
head(colc_clm23$Date)
## [1] "2023-12-31" "2023-12-30" "2023-12-29" "2023-12-28" "2023-12-27"
## [6] "2023-12-26"

The date column has been formatted correctly to date format

summary(colc_clm23)
##    station_ID        Date            TemperatureCAvg TemperatureCMax
##  Min.   :3590   Min.   :2023-01-01   Min.   :-2.60   Min.   : 1.70  
##  1st Qu.:3590   1st Qu.:2023-04-02   1st Qu.: 7.20   1st Qu.:10.60  
##  Median :3590   Median :2023-07-02   Median :10.40   Median :14.20  
##  Mean   :3590   Mean   :2023-07-02   Mean   :10.92   Mean   :15.13  
##  3rd Qu.:3590   3rd Qu.:2023-10-01   3rd Qu.:15.80   3rd Qu.:20.00  
##  Max.   :3590   Max.   :2023-12-31   Max.   :23.10   Max.   :30.40  
##                                                                     
##  TemperatureCMin      TdAvgC           HrAvg        WindkmhDir       
##  Min.   :-6.200   Min.   :-4.400   Min.   :43.10   Length:365        
##  1st Qu.: 3.200   1st Qu.: 4.400   1st Qu.:75.60   Class :character  
##  Median : 6.300   Median : 7.600   Median :81.70   Mode  :character  
##  Mean   : 6.365   Mean   : 7.578   Mean   :81.25                     
##  3rd Qu.:10.600   3rd Qu.:11.200   3rd Qu.:87.90                     
##  Max.   :16.300   Max.   :17.500   Max.   :97.90                     
##                                                                      
##    WindkmhInt     WindkmhGust      PresslevHp         Precmm      
##  Min.   : 6.20   Min.   :13.00   Min.   : 967.4   Min.   : 0.000  
##  1st Qu.:12.00   1st Qu.:31.50   1st Qu.:1006.3   1st Qu.: 0.000  
##  Median :16.10   Median :38.90   Median :1014.3   Median : 0.000  
##  Mean   :16.81   Mean   :40.87   Mean   :1013.6   Mean   : 1.866  
##  3rd Qu.:20.20   3rd Qu.:48.20   3rd Qu.:1021.7   3rd Qu.: 1.150  
##  Max.   :37.50   Max.   :98.20   Max.   :1045.1   Max.   :33.600  
##                                                   NA's   :27      
##     TotClOct        lowClOct         SunD1h           VisKm      
##  Min.   :0.000   Min.   :1.800   Min.   : 0.000   Min.   : 3.60  
##  1st Qu.:3.600   1st Qu.:5.800   1st Qu.: 1.150   1st Qu.:22.70  
##  Median :5.100   Median :6.700   Median : 4.700   Median :31.50  
##  Mean   :4.988   Mean   :6.443   Mean   : 5.127   Mean   :32.11  
##  3rd Qu.:7.000   3rd Qu.:7.400   3rd Qu.: 8.050   3rd Qu.:41.50  
##  Max.   :8.000   Max.   :8.000   Max.   :15.400   Max.   :72.90  
##                  NA's   :13      NA's   :82                      
##  PreselevHp       SnowDepcm  
##  Mode:logical   Min.   :1    
##  NA's:365       1st Qu.:1    
##                 Median :1    
##                 Mean   :1    
##                 3rd Qu.:1    
##                 Max.   :1    
##                 NA's   :364

The summary output of the climate dataset (colc_clm23) dataset provides a comprehensive overview of climate data recorded at station ID 3590 throughout 2023. It includes temperature, humidity, wind speed and direction, pressure, precipitation, cloudiness, sunshine duration, visibility, and snow depth. For example, temperatures ranged from -2.60°C to 23.10°C, with a median of 10.40°C for average temperature, 14.20°C for maximum temperature, and 6.30°C for minimum temperature. Humidity varied from 43.10% to 97.90%, with a median of 81.70%. Wind speed ranged from 6.20 km/h to 37.50 km/h, with gusts reaching up to 98.20 km/h. Sea-level pressure ranged from 967.4 hPa to 1045.1 hPa. Precipitation totals had a mean value of 1.866 mm, with a maximum of 33.600 mm and 27 missing values. Cloudiness varied from 0.000 to 8.000 octants, with 13 missing values for low-level cloudiness. Sunshine duration ranged from 0.000 to 15.400 hours, with a mean of 5.127 hours and 82 missing values. Visibility ranged from 3.60 km to 72.90 km. Snow depth remained constant at 1 cm for all observations except for 364 missing values.

# Check missing values
clm_missing_val <- colSums(is.na(colc_clm23))
clm_missing_val
##      station_ID            Date TemperatureCAvg TemperatureCMax TemperatureCMin 
##               0               0               0               0               0 
##          TdAvgC           HrAvg      WindkmhDir      WindkmhInt     WindkmhGust 
##               0               0               0               0               0 
##      PresslevHp          Precmm        TotClOct        lowClOct          SunD1h 
##               0              27               0              13              82 
##           VisKm      PreselevHp       SnowDepcm 
##               0             365             364

The result above reveals missing values in several columns of the climate dataset Precmm has 27 missing values, lowClOct has 13 missing values, SunD1h has 82 missing values, PreselevHp has 365 missing values, and SnowDepcm has 364 missing values.

# Handling missing data

# Mean imputation for Precmm and lowClOct
colc_clm23$Precmm[is.na(colc_clm23$Precmm)] <- mean(colc_clm23$Precmm, na.rm = TRUE)
colc_clm23$lowClOct[is.na(colc_clm23$lowClOct)] <- mean(colc_clm23$lowClOct, na.rm = TRUE)
colc_clm23$SunD1h[is.na(colc_clm23$SunD1h)] <- mean(colc_clm23$SunD1h, na.rm = TRUE)


# Exclude PreselevHp and SnowDepcm from the dataset
colc_clm23 <- colc_clm23[, !(names(colc_clm23) %in% c("PreselevHp", "SnowDepcm"))]

# confirm missing data is handled
str(colc_clm23)
## 'data.frame':    365 obs. of  16 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : Date, format: "2023-12-31" "2023-12-30" ...
##  $ TemperatureCAvg: num  8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
##  $ TemperatureCMax: num  10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
##  $ TemperatureCMin: num  4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
##  $ TdAvgC         : num  7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
##  $ HrAvg          : num  89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
##  $ WindkmhDir     : chr  "S" "WSW" "SW" "SSW" ...
##  $ WindkmhInt     : num  25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
##  $ WindkmhGust    : num  63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
##  $ PresslevHp     : num  999 1007 1004 1003 1016 ...
##  $ Precmm         : num  6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
##  $ TotClOct       : num  8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
##  $ lowClOct       : num  8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
##  $ SunD1h         : num  0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
##  $ VisKm          : num  26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...

Given the missing values in Precmm, lowClOct, SunD1h (27 and 13, 82 respectively), a mean imputation approach was applied to replace these missing values. This is a common approach to handle missing data. It is assumed that the missing values are missing completely at random or missing at random. PreselevHp and SnowDepcm with missing values of 365 and 364, respectively) were dropped from the dataset due to all of their values missing. The result showed that there are no more missing values and the empty columns are now dropped. Hence, further analysis can now be carried out.

Exploratory Data Analysis and Visualisation

# Load libraries
library(ggplot2) 
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Create time series plots to visualize the trends and variations of different climate variables over time

# Temperature Variation plot with smoothing
temperature_plot <- ggplot(colc_clm23, aes(x = Date)) +
  geom_line(aes(y = TemperatureCAvg, color = "Average Temperature")) +
  geom_line(aes(y = TemperatureCMax, color = "Max Temperature")) +
  geom_line(aes(y = TemperatureCMin, color = "Min Temperature")) +
  geom_smooth(aes(y = TemperatureCAvg), method = "loess", color = "red", se = FALSE) +  # Smoothing
  geom_smooth(aes(y = TemperatureCMax), method = "loess", color = "blue", se = FALSE) +  # Smoothing
  geom_smooth(aes(y = TemperatureCMin), method = "loess", color = "green", se = FALSE) +  # Smoothing
  labs(title = "Temperature Variation",
       x = "Date",
       y = "Temperature (°C)",
       color = "Type") +
  theme_minimal() +  # Remove gridlines
  theme(panel.grid = element_blank())  # Remove gridlines

# Visualize precipitation data with smoothing
precipitation_plot <- ggplot(colc_clm23, aes(x = Date, y = Precmm)) +
  geom_bar(stat = "identity", fill = "blue") +
  geom_smooth(aes(y = Precmm), method = "loess", color = "red", se = FALSE) +  # Smoothing
  labs(title = "Daily Precipitation",
       x = "Date",
       y = "Precipitation (mm)") +
  theme_minimal() +  # Remove gridlines
  theme(panel.grid = element_blank())  # Remove gridlines

# Visualize wind speed data with smoothing
wind_speed_plot <- ggplot(colc_clm23, aes(x = Date, y = WindkmhInt)) +
  geom_line(color = "green") +
  geom_smooth(aes(y = WindkmhInt), method = "loess", color = "blue", se = FALSE) +  # Smoothing
  labs(title = "Wind Speed Variation",
       x = "Date",
       y = "Wind Speed (km/h)") +
  theme_minimal() +  # Remove gridlines
  theme(panel.grid = element_blank())  # Remove gridlines

# Arrange plots in a 3x1 grid 
grid.arrange(temperature_plot, precipitation_plot, wind_speed_plot, nrow = 3, ncol = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The time series above shows the variations of three climate variables in Colchester over a year (2023). The variables plotted are Average temperature (°C), daily precipitation(mm) and wind speed (km/h). Each plot uses a line to show the trend of the variable over time, Additionally, a smoothed line is overlaid on the temperature plot to highlight the general variations without the high frequency fluctuations. The precipitation plots uses bar to represented the amount of precipitation each day.

The average temperature follows a seasonal pattern, with higher temperatures in the summer months and lower temperatures in the winter months. The smoothed line suggests a gradual increase in temperature throughout the year, potentially indicating a warming trend. There appears to be some variation in precipitation throughout the year with some periods such as July ending, the beginning of August, and the beginning of November, all receiving more rain than months. Wind speed also appears to vary throughout the year, higher speed sometimes in January and May with potentially higher speeds in the winter.

# Plot time series of TemperatureCMax and TemperatureCMin
temperature_time_series <- ggplot(colc_clm23, aes(Date)) +
  geom_smooth(aes(y = TemperatureCMax, color = "TemperatureCMax"), method = "loess", se = FALSE) + # Add Smoothed Trend Lines
  geom_smooth(aes(y = TemperatureCMin, color = "TemperatureCMin"), method = "loess", se = FALSE) + # Add Smoothed Trend Lines
  labs(x = "Date", y = "Temperature (°C)", color = "Variable") +
  theme_minimal() +
  scale_color_manual(values = c("TemperatureCMax" = "red", "TemperatureCMin" = "blue")) +
  ggtitle("Time Series of Maximum and Minimum Temperatures") +
  theme(plot.title = element_text(size = 12))

# Plot time series of Precmm, SunD1h, WindkmhInt and TotClOct
weather_time_series <- ggplot(colc_clm23, aes(Date)) +
  geom_line(aes(y = Precmm, color = "Precipitation (mm)"), linetype = "dashed") +
  geom_line(aes(y = SunD1h, color = "Sunshine Duration (hours)"), linetype = "dotted") +
  geom_line(aes(y = WindkmhInt, color = "Wind Speed (km/h)")) +
  geom_line(aes(y = TotClOct, color = "Total Cloudiness (octants)")) +
  labs(x = "Date", y = "Value", color = "Variable") +
  theme_minimal() +
  scale_color_manual(values = c("Precipitation (mm)" = "green", "Sunshine Duration (hours)" = "orange", "Wind Speed (km/h)" = "purple", "Total Cloudiness (octants)" = "blue")) +
  ggtitle("Time Series of Precipitation, Sunshine Duration, Wind Speed, and Total Cloudiness") +
  theme(plot.title = element_text(size = 12)) 

# Arrange Plots
par(mfrow = c(2, 1)) # Set the layout to 2 rows and 1 column
print(temperature_time_series)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

print(weather_time_series)

The plots above shows time series plots for maximum and minimum temperatures and of four weather variables in Colchester over a year (2023).The use of smoothing helps highlight the general trends without being distracted by the high-frequency fluctuations in the data, though it may obscure some of the short term variations in temperature.

For the temperature plot, the smoothed lines shows a clear seasonal pattern, with high temperatures in the summer months and low temperatures in the winter months. This is typical for most temperate climates in the Northern Hemisphere. The difference between the maximum and minimum temperature appears to be larger in the summer months compared to the winter months. This suggests that summers in Colchester experience warmer highs and cooler lows, while winters have milder variations.

For the weather variables, the precipitation appears to have some variation throughout the year with potentially higher amount in some periods such as in October compared to others. The sunshine duration appears to fluctuate throughout the year with potentially longer duration in the summer months based on the higher peaks. Wind speed also appears to vary throughout the year, with possible high speeds during some periods during the winter. The total Cloudiness line shows variations with potentially higher values corresponding to periods with lower sunshine duration (indicated by the orange line). This suggests a possible link between cloud cover and sunshine hours, as expected.

Box plot
# Define function to extract season from date
season <- function(date) {
  months <- as.numeric(format(date, "%m"))
  ifelse(months %in% 3:5, "Spring",
         ifelse(months %in% 6:8, "Summer",
                ifelse(months %in% 9:11, "Autumn", "Winter")))
}

# Create boxplot for cloud cover by season with expanded upper part
ggplot(colc_clm23, aes(x = factor(season(Date)), y = TotClOct)) +
  geom_boxplot(fill = "lightblue", color = "blue") +
  labs(title = "Distribution of Total Cloud Cover by Season",
       x = "Season",
       y = "Total Cloud Cover (Octants)") +
  theme_minimal() +
  ylim(0, quantile(colc_clm23$TotClOct, 1))  # Expand upper part of y-axis

The boxplot visualises the distribution of total cloud cover (TotClOct) across seasons in Colchester. The boxplot shows the spread of TotClOct values for each season (Winter, Spring, Summer, Autumn). The horizontal lines within each box represent the median TotClOct value for that season. The boxes indicate that the distribution of TotClOct can vary across seasons.

The box for Spring suggests the most extensive spread, implying high variability in TotClOct values during this season. The Summer box appears to have a narrower spread compared to Spring, indicating less variation in TotClOct during summer months. The Autumn and Winter boxes have similar spreads. The whiskers extending from the boxes represent the range of TotClOct values within 1.5 times the interquartile range (IQR) from the quartiles. Any data points beyond the whiskers are outliers and are plotted as individual circles.

The boxplot does not definitively indicate which season has the highest or lowest total cloud cover. However, it suggests that Spring might have the most variable cloud cover, while Summer might have a more consistent pattern. To determine which season has the highest/lowest median cloud cover, we can examine the relative positions of the median lines (horizontal lines within the boxes). Spring’s median is the highest, followed by Winter and Autumn. Summer might have the lowest median TotClOct.

Conclusion

The datasets have exciting features from the analysis and visualisation that was carried out. However, both data only cover a single year, so it’s difficult to draw conclusions about long-term trends. Having more datasets would allow us to see if the patterns observed this year are consistent with historical trends or if they represent an anomaly.

References

  1. https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/

  2. https://sape.inf.usi.ch/quick-reference/ggplot2/colour#:~:text=Red%20Green%20Blue%20(RGB)%20Colour,of%20%5B0%2C%201%5D).

  3. https://stackoverflow.com/

  4. MA304 Lecture slides and lab solutions.